Tei-conformant Structural Markup of a Trilingual Parallel Corpus in the Eci Multilingual Corpus 1 1. Overview of the Eci Corpus 1.1. Brief History and Acknowledgements
نویسندگان
چکیده
In this paper we provide an overview of the ACL European Corpus Initiative (ECI) Multilingual Corpus 1 (ECI/MC1). In particular, we look at one particular subcorpus in the ECI/MC1, the trilingual corpus of International Labour Organisation reports, and discuss the problems involved in TEI-compliant structural markup and preliminary alignment of this large corpus. We discuss gross structural alignment down to the level of text paragraphs. We see this as a necessary rst step in corpus preparation before detailed (possibly automatic) alignment of texts is possible. We try and generalise our experience with this corpus to illustrate the process of preliminary markup of large corpora which in their raw state can be in an arbitrary format (eg printers tapes, proprietary word-processor format); noisy (not fully parallel, with structure obscured by spelling mistakes); full of poorly documented formatting instructions; and whose structure is present but anything but explicit. We illustrate these points by reference to other parallel subcorpora of ECI/MC1. We attempt to deene some guidelines for the development of corpus annotation toolkits which would aid this kind of structural preparation of large corpora. The ECI arose as a result of a concern shared by a number of European researchers in computational linguistics that waiting for fully funded support for collection and distribution of non-English corpus material would mean waiting too long. This concern crystalised into action, modelled on the Association for Computational Linguistics (ACL) Data Collection Initiative, following a meeting in Pisa sponsored by the Network for European Reference Corpora (NERC) in 1992. The original call for contributions to the ECI described it as follows: The European Corpus Initiative was founded to oversee the acquisition and preparation of a large multi-lingualcorpus to be made available in digital form for scientiic research at cost and without royalties. We believe that widespread easy access to such material would be a great stimulus to scientiic research and technology development as regards language and language technology. We support existing and projected national and international eeorts to carefully design, collect and publish large-scale multi-lingualwritten and spoken corpora, but also believe it will be some time before the scientiic and material resources necessary to bring these projects to fruition will be found. In the interim, a small and rapid eeort to collect and distribute existing material can serve to show the way. No amount of abstract argument as to the value of corpus material is as powerful …
منابع مشابه
Data in Your Language : The ECI
In this paper we describe the contents and the method of production of the ACL European Corpus Initiative Multilingual Corpus 1 (ECI/MC1). This is a large multilingual electronic text corpus, containing 97 million words in 27 (mainly European) languages. It is available cheaply on CDROM. Most of the texts in the corpus are marked up using a fully-validated SGML document type description based o...
متن کاملMultilingual Corpora for Cooperation
MLCC was a corpus, acquisition project funded by the EC Telematics program.The aim was to collect a set of texts representing a substantial improvement in range, quantity and quality of corpus material available. Two sub-corpora have been defined to help meet the needs for multilingual data consisting of a comparable set of texts in six languages and a parallel set of data in 9 languages. The c...
متن کاملLexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities
This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...
متن کاملIn Memoriam: Susan Armstrong
Susan Armstrong worked as Professor of Translation Technology at the University of Geneva until her retirement in 2014. She served as secretary to the European chapter of the ACL from 1993–2000, remaining on the chapter’s nominating committee until 2004. She had a fundamental role in the founding and successful development of SIGDAT. Susan arrived in Switzerland from the United States in 1978 t...
متن کاملThe MULTEXT-East corpus
The EU MULTEXT-East project has produced harmonised language resources for Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In this paper we introduce the MULTEXT-East multilingual corpus, which comprises marked-up texts in the six languages totaling approximately 2 million words and a small speech corpus. The corpus is encoded in SGML, in the TEI-like Corpus Encoding Specification...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994